Perception Of Spoken Words Research Articles

Speech perception is vitally important, thus the visual system may possess exceptional sensitivity to track speech-related signals conveyed by lip movements even without awareness of the speaking face. We tested this possibility using a technique called continuous flash suppression (CFS) (e.g., Tsuchiya & Koch, 2005), in which a critical stimulus, here a speaking face, was presented to one eye and strong dynamic-masking noise was presented to the other eye (Figure 1A), rendering the speaking face invisible. We determined whether the visual system could still encode spatiotemporal patterns of lip movements when observers were aware of only the randomly flashing masking display. A. Trial sequence of a masked-face (face invisible) trial. Each trial began with a fixation cross (0.36° × 0.36°) lasting 3000 ms. While the face was presented to one eye, a strong dynamic mask, consisting of a random array of ... We measured the encoding of invisible lip movements as crossmodal facilitation of spoken word categorization (e.g., Sumby & Pollack, 1954). Participants determined whether each spoken word was a target word (a tool name) or a non-target word (a name of a non-tool object) while they concurrently viewed a face that either spoke the same word—the congruent condition—or a different word—the incongruent condition. Prior research suggests that spatial attention influences unaware as well as aware visual processing (e.g., Cohen et al., 2012), and that attention to the mouth region is necessary for lip movements to facilitate spoken word perception (e.g., Alsius et al., 2005; Driver & Spence, 1994). To help direct attention to the mouth region, on half of the trials, we presented the face without the dynamic mask (Figure 1B), and we instructed participants to localize a small probe briefly presented near the mouth (Supplementary Figure S1). These “attention-enforcement” trials were randomly intermixed with the critical masked-face (face invisible) trials. To further enforce attention to the mouth region on the masked-face trials, the probe also appeared on the masked face, and participants were instructed to report its location whenever the face became visible through the mask. Participants reported seeing the face on 5% of the masked-face trials, and the data from those trials were removed from the analyses. If the visual system automatically extracts lip movements even when they are invisible, spoken-word categorization should be facilitated by congruent lip movements even when the face is invisible on the masked-face trials. Indeed, responses to the spoken target words on the masked-face trails were significantly faster when the lip movements were congruent than incongruent, t45=2.66, p 3.50, ps<0.005). To control for the possibility that participants might have failed to report face visibility on some of the masked-face trials, we performed a second experiment incorporating a more stringent indicator of face visibility. A tinted translucent ellipse was placed over the mouth region of the face. On each trial, after responding to the spoken word, participants were asked to report the color of the ellipse (red, blue, green, or yellow); critically, they were required to guess if they thought that they had not seen the face. All masked-face trials on which participants correctly reported the color (30%) were removed from analysis. The same pattern of results was obtained. On the masked-face (face invisible) trials, responses to the spoken target words were significantly faster when the lip movements were congruent than incongruent, t23=2.12, p<0.05 (Figure 1D), with a mean accuracy of 92% with no evidence of a speed-accuracy trade-off. Also consistent with the original experiment, there was no congruency effect for target responses on the attention-enforcement trials, t23=0.80, n.s. (Figure 1F) or for non-target responses (1777ms [congruent] vs. 1743ms [incongruent], t23=1.06, n.s. on the masked-face trials, and 1708ms [congruent] vs. 1779ms [incongruent], t23=1.97, n.s. on the attention-enforcement trials.). These results demonstrate that even when a speaking face is rendered invisible by a dynamic mask with strong motion signals, the visual system accurately encodes invisible lip movements to facilitate auditory perception of the corresponding spoken words. This crossmodal effect is likely to occur at the level of encoding words; it has been shown that invisible lip movements do not generate a McGurk effect (Palmer & Ramsey, 2012), suggesting that invisible lip movements do not influence auditory perception at the level of encoding syllables. Dorsal motion processing mechanisms (e.g., V3a, V5) would have predominantly responded to the strong and visible flashing mask (e.g., Moutoussis et al., 2005). The invisible lip movements would thus likely have been processed through the ventral visual pathway including the superior temporal sulcus (STS), an area that selectively responds to biological motion and movements of facial features (e.g., Allison et al., 2000; Calvert & Campbell, 2003; Grossman et al., 2000), and facilitated spoken word perception via multimodal portions of the STS (e.g., Calvert et al., 2000). Sophisticated unconscious processing of static images (e.g., words, faces, sex of human body, and contextual congruence; Jiang et al., 2006; Jiang et al., 2007; Mudrik et al., 2011; Yang et al., 2007) has been demonstrated. Our results extend these prior findings to the processing of dynamic information. Static information can theoretically be extricated from a dynamic mask by temporal averaging. However, unconscious extrication of the subtle dynamics of lip movements from the overwhelming random dynamics of the mask requires sophisticated tuning of the ventral visual system to the behaviorally relevant dynamics.

Selective attention to phonology, i.e., the ability to attend to sub-syllabic units within spoken words, is a critical precursor to literacy acquisition. Recent functional magnetic resonance imaging evidence has demonstrated that a left-lateralized network of frontal, temporal, and posterior language regions, including the visual word form area, supports this skill. The current event-related potential (ERP) study investigated the temporal dynamics of selective attention to phonology during spoken word perception. We tested the hypothesis that selective attention to phonology dynamically modulates stimulus encoding by recruiting left-lateralized processes specifically while the information critical for performance is unfolding. Selective attention to phonology was captured by manipulating listening goals: skilled adult readers attended to either rhyme or melody within auditory stimulus pairs. Each pair superimposed rhyming and melodic information ensuring identical sensory stimulation. Selective attention to phonology produced distinct early and late topographic ERP effects during stimulus encoding. Data-driven source localization analyses revealed that selective attention to phonology led to significantly greater recruitment of left-lateralized posterior and extensive temporal regions, which was notably concurrent with the rhyme-relevant information within the word. Furthermore, selective attention effects were specific to auditory stimulus encoding and not observed in response to cues, arguing against the notion that they reflect sustained task setting. Collectively, these results demonstrate that selective attention to phonology dynamically engages a left-lateralized network during the critical time-period of perception for achieving phonological analysis goals. These findings suggest a key role for selective attention in on-line phonological computations. Furthermore, these findings motivate future research on the role that neural mechanisms of attention may play in phonological awareness impairments thought to underlie developmental reading disabilities.

Perception Of Spoken Words Research Articles

Related Topics

Articles published on Perception Of Spoken Words

Examining listeners' perception of spoken words with different face masks.

Modulations of right hemisphere connectivity in young children relates to the perception of spoken words

Bottom-up and top-down effects on the perception of children's speech

Biasing the Perception of Spoken Words with Transcranial Alternating Current Stimulation

Eye movements reveal fast, voice-specific priming.

Phonological Neighborhood Effects in Spoken Word Perception and Production

Longitudinal performance of spoken word perception in Mandarin pediatric cochlear implant users

Do We Perceive Others Better than Ourselves? A Perceptual Benefit for Noise-Vocoded Speech Produced by an Average Speaker.

Deaf children with cochlear implants do not appear to use sentence context to help recognize spoken words.

Fronto-temporal connectivity is preserved during sung but not spoken word listening, across the autism spectrum.

Lip Reading Without Awareness

Selective attention to phonology dynamically modulates initial encoding of auditory words within the left hemisphere

Phonological Neighborhood Effects in Spoken Word Perception and Production

The socially weighted encoding of spoken words: a dual-route approach to speech perception

Famous talker effects in spoken word recognition

On the tolerance of spectral blur in the perception of spoken words

Exploring auditory aging can exclusively explain Japanese adults’ age-related decrease in training effects of American English /r/-/l/

Examining the role of variability in emotional tone of voice during online spoken word recognition

Examining the role of talker‐specific details in bilingual listeners’ perception of spoken words.

Investigating whether ease of lexical processing affects listeners’ ratings of the strength of talkers’ foreign accent.

Lead the way for us